DSDA Data Visualisation Portfolio Assessment

Global Covid-19 Data Visualisation

The aim of this notebook is to analyse and understand the impact COVID-19 has had globally, whilst learning and consolidating data visualisation techniques within python.

Import Relevant Libraries

In [39]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
import math
import os
from datetime import datetime, timedelta
import plotly.offline as pyo
# Set notebook mode to work in offline
pyo.init_notebook_mode()

Date Formatting

In [2]:
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d')

Read the data (csv files) using pandas

Understand what columns are available and the different data types

In [3]:
df_daily = pd.read_csv('COVID_Daily.csv',parse_dates=['date'], date_parser=dateparse)
df_daily.head()
Out[3]:
date country cumulative_total_cases daily_new_cases active_cases cumulative_total_deaths daily_new_deaths
0 2020-02-15 Afghanistan 0.0 NaN 0.0 0.0 NaN
1 2020-02-16 Afghanistan 0.0 NaN 0.0 0.0 NaN
2 2020-02-17 Afghanistan 0.0 NaN 0.0 0.0 NaN
3 2020-02-18 Afghanistan 0.0 NaN 0.0 0.0 NaN
4 2020-02-19 Afghanistan 0.0 NaN 0.0 0.0 NaN
In [4]:
df_daily.columns
Out[4]:
Index(['date', 'country', 'cumulative_total_cases', 'daily_new_cases',
       'active_cases', 'cumulative_total_deaths', 'daily_new_deaths'],
      dtype='object')
In [5]:
df_daily.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170917 entries, 0 to 170916
Data columns (total 7 columns):
date                       170917 non-null datetime64[ns]
country                    170917 non-null object
cumulative_total_cases     170917 non-null float64
daily_new_cases            161237 non-null float64
active_cases               161011 non-null float64
cumulative_total_deaths    164059 non-null float64
daily_new_deaths           145063 non-null float64
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 9.1+ MB
In [6]:
# df_daily.describe
In [7]:
summary = pd.read_csv('COVID_Summary.csv')
summary.head()
Out[7]:
country continent total_confirmed total_deaths total_recovered active_cases serious_or_critical total_cases_per_1m_population total_deaths_per_1m_population total_tests total_tests_per_1m_population population
0 Afghanistan Asia 176983 7651.0 159744.0 9588.0 1124.0 4378 189.0 909906.0 22511.0 40421365
1 Albania Europe 272885 3487.0 268764.0 634.0 13.0 95001 1214.0 1762808.0 613697.0 2872441
2 Algeria Africa 265511 6870.0 178137.0 80504.0 8.0 5874 152.0 230861.0 5108.0 45199871
3 Andorra Europe 39234 153.0 38377.0 704.0 14.0 506402 1975.0 249838.0 3224715.0 77476
4 Angola Africa 99003 1900.0 96951.0 152.0 NaN 2861 55.0 1473371.0 42575.0 34606502
In [8]:
summary.columns
Out[8]:
Index(['country', 'continent', 'total_confirmed', 'total_deaths',
       'total_recovered', 'active_cases', 'serious_or_critical',
       'total_cases_per_1m_population', 'total_deaths_per_1m_population',
       'total_tests', 'total_tests_per_1m_population', 'population'],
      dtype='object')
In [9]:
summary.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 12 columns):
country                           225 non-null object
continent                         225 non-null object
total_confirmed                   225 non-null int64
total_deaths                      216 non-null float64
total_recovered                   211 non-null float64
active_cases                      212 non-null float64
serious_or_critical               162 non-null float64
total_cases_per_1m_population     225 non-null int64
total_deaths_per_1m_population    216 non-null float64
total_tests                       211 non-null float64
total_tests_per_1m_population     211 non-null float64
population                        225 non-null int64
dtypes: float64(7), int64(3), object(2)
memory usage: 21.2+ KB
In [10]:
# summary.describe

Add continent column from summary data to daily data to enable more visualisations and optimise data usability

In [11]:
df_daily['continent'] = df_daily.apply(lambda row:summary[summary.country == row.country].iloc[0].continent, axis=1)
In [12]:
df_daily.head()
Out[12]:
date country cumulative_total_cases daily_new_cases active_cases cumulative_total_deaths daily_new_deaths continent
0 2020-02-15 Afghanistan 0.0 NaN 0.0 0.0 NaN Asia
1 2020-02-16 Afghanistan 0.0 NaN 0.0 0.0 NaN Asia
2 2020-02-17 Afghanistan 0.0 NaN 0.0 0.0 NaN Asia
3 2020-02-18 Afghanistan 0.0 NaN 0.0 0.0 NaN Asia
4 2020-02-19 Afghanistan 0.0 NaN 0.0 0.0 NaN Asia

Create function to add commas to the numbers

In [13]:
def count(num):
    out = ""
    counter = 0
    for n in num[::-1]:
        counter += 1
        if counter == 4:
            counter = 1
            out = "," + out
        out = n + out
    return out

Scale Values

In [16]:
value = list(range(0,25,2))
log_scale = (np.exp2(value)).astype(int).astype(str)

log_scale = list(map(count, log_scale))

Create Interactive Choropleth map

In [42]:
active_df = df_daily[['date', 'country', 'active_cases']].dropna().sort_values('date')
active_df = active_df[active_df.active_cases > 0]
active_df['log2(active_cases)'] = np.log2(active_df['active_cases'])
active_df['date'] = active_df['date'].dt.strftime('%m/%d/%Y')

fig = px.choropleth(active_df, locations="country", locationmode='country names',
                    color="log2(active_cases)", hover_name="country", hover_data=['active_cases'],
                    projection="natural earth", animation_frame="date",
                    title='<b>COVID-19 Global Active Cases Over Time</b>',
                    color_continuous_scale="inferno_r", # invert colour scale
                   )
# Values on colour bar
fig.update_layout(coloraxis={"colorbar": {"title":"Active Cases",
                                          "tickvals":value,
                                          "ticktext":log_scale}})
                 

fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1
fig.layout.updatemenus[0].buttons[0].args[1]['transition']['duration'] = 1

#fig.show()

Create Pie Chart to analyse GLOBAL statistics

Percentages hiding true values, as a data scientist we understand the full implications. More details within presentation.

In [40]:
pie1 = go.Pie(labels=['Total Active','Total Recovered', 'Total Deaths'],
               values=[summary.active_cases.sum(), summary.total_recovered.sum(), summary.total_deaths.sum()], 
               title="<b> Total Recovered, Active Coronavirus Cases and Deaths</b> ",
               marker=dict(colors=["lightgreen", "paleblue", "red"]),
             )

vis_1=go.Figure(data=[pie1])
vis_1.show()

Funtion to return actual number of deaths

In [22]:
# Actual total number of Deaths

print(summary.total_deaths.sum())
6088834.0

Create Area Chart to show the top 10 countries with the most active cases

In [23]:
fig = px.area(df_daily[df_daily.country.isin(df_daily[df_daily.date == df_daily.date.max()].sort_values("active_cases", ascending=False).iloc[:10].country)].sort_values("active_cases", ascending=False),
              x="date", y="active_cases", color="country", template="plotly_white")#, groupnorm='percent')

fig.update_traces(line={"width":2})
fig.update_layout(title = f"Top 10 Countries with Most Active Cases on {df_daily.date.max().strftime('%Y-%m-%d')}",
                  xaxis={"title": "Date"},
                  yaxis={"title":"Active Cases"})

Create Tree Map to show total deaths by country

In [24]:
vis_2 = px.treemap(summary, path=["country"], values="total_deaths", height = 500,
                 title="Total Coronavirus Deaths by Country",
                 color_discrete_sequence = px.colors.qualitative.Set1)
vis_2.show()

Create Tree Map to show total Confirmed cases by country

In [25]:
vis_3 = px.treemap(summary, path=["country"], values="total_confirmed", height = 500,
                 title="Total Confirmed Coronavirs Cases by Country",
                 color_discrete_sequence = px.colors.qualitative.Set1)
vis_3.show()

Choropleth map to show total death hotspots around the world

In [26]:
summary['log(Total Deaths)'] = np.log2(summary['total_deaths'])
summary['Total Deaths'] = summary['total_deaths'].apply(lambda x: count(str(x)))


vis_4 = px.choropleth(summary,
                        locations="country",
                        color="log(Total Deaths)",
                        locationmode = 'country names',
                        hover_name='country',
                        hover_data=['Total Deaths'],
                        color_continuous_scale='rdylgn_r',
                        title = '<b>Coronavirus Deaths Around The Globe</b>')

vis_4.update_layout(title_font_size=15,
                  margin={"r":20, "l":30},
                  coloraxis={"colorbar":dict(title="<b>Total Deaths</b><br>",
                                    tickvals=value,
                                    ticktext=log_scale)})
vis_4.show()

Choropleth map to show confirmed cases hotspots around the world

In [27]:
summary['log(Total Confirmed)'] = np.log2(summary['total_confirmed'])
summary['Total Confirmed'] = summary['total_confirmed'].apply(lambda x: count(str(x)))


vis_5 = px.choropleth(summary,
                        locations="country",
                        color="log(Total Confirmed)",
                        locationmode = 'country names',
                        hover_name='country',
                        hover_data=['Total Confirmed'],
                        color_continuous_scale='rdylgn_r',
                        title = '<b>Coronavirus Confirmed Cases Around The Globe</b>')

vis_5.update_layout(title_font_size=15,
                  margin={"r":20, "l":30},
                  coloraxis={"colorbar":dict(title="<b> Total Confirmed</b><br>",
                                    tickvals=value,
                                    ticktext=log_scale)})
vis_5.show()

Visualise Cumulative Total Deaths By each Continent

In [28]:
summary.continent.unique()
Out[28]:
array(['Asia', 'Europe', 'Africa', 'North America', 'South America',
       'Australia/Oceania'], dtype=object)
In [29]:
def deathsbycontinent(continent):
    death_continent = df_daily[df_daily.continent == continent]
    death_continentdf = death_continent.dropna()
    vis_6 = px.line(death_continentdf, x="date", y="cumulative_total_deaths", color="country", #log_y=True,
                  line_group="country", hover_name="country", template="seaborn")

    annotations = []
    ann = []
    for label in vis_6.select_traces():
        ann.append(label.y[-1])
    y_scale = 0.155 / max(ann)
    for label in vis_6.select_traces():
        # labeling the right_side of the plot
        size = max(1, int(math.log(label.y[-1], 1.1) * label.y[-1] * y_scale))
        annotations.append(dict(x=label.x[-1] + timedelta(hours=int((2 + size/5) * 24)), y=label.y[-1],
                                xanchor='left', yanchor='middle',
                                text=label.name,
                                font=dict(family='Arial',
                                size=7+int(size/2)),
                                showarrow=False))
        vis_6.add_trace(go.Scatter(
            x=[label.x[-1]],
            y=[label.y[-1]],
            mode='markers',
            name=label.name,
            marker=dict(color=label.line.color, size=size)
        ))
    vis_6.update_traces(line={'width':2})
    vis_6.update_layout(annotations=annotations, showlegend=True, uniformtext_mode='hide',
                      title=f"<b>Cumulative Total Coronavirus Deaths in {continent}<br>between {death_continentdf.date.min().strftime('%Y-%m-%d')} and {death_continentdf.date.max().strftime('%Y-%m-%d')}</b>",
                      xaxis={'title':'Date'},
                      yaxis={'title':'Coronavirus Confirmed Deaths'}
                      
                     )
    vis_6.show()
In [30]:
deathsbycontinent("Europe") # change name of continent here to analyse different results. 

Tree map to show breakdwon of active cases by country

In [31]:
fig = px.treemap(summary, path=["country"], values="active_cases", height = 750,
                 title=f"<b>Active Cases Breakdown on {df_daily.date.max().strftime('%Y-%m-%d')}</b>",
                 color_discrete_sequence = px.colors.qualitative.Set1)

fig.update_traces(textinfo = "label+text+value") # create hover value
fig.show()

Deep Dive into different Countires patterns, trends and behaviour

In [32]:
# summary.country.unique()
In [33]:
def countrystat(country):
    if country in ["UK", "USA"]:
        prefix = "The "
    else:
        prefix = ""
    c = df_daily[df_daily.country == country]
   
    c.set_index('date', inplace=True)
    
# 1. Cumulative total cases

    if not all(c.cumulative_total_cases.isna()):
        layout = go.Layout(yaxis={'range':[0, c.cumulative_total_cases[-1] * 1.05],'title':'Coronavirus Confirmed Cases'},xaxis={'title':''},)

        fig = px.area(c, x=c.index, y="cumulative_total_cases",
                      title=f"Cumulative Total Confirmed Cases in {prefix}{country} from {c.index[0].strftime('%Y-%m-%d')} till {c.index[-1].strftime('%Y-%m-%d')}", 
                      template='plotly')

        fig.update_traces(line={'width':5})
        fig.update_layout(layout)
        fig.show()
        
# 2. Daily new cases with 7-day moving average

    if not all(c.daily_new_cases.isna()):
        layout = go.Layout(
        yaxis={'range':[0, c.daily_new_cases.max() * 1.05],'title':'Daily New Coronavirus Confirmed Cases'},
        xaxis={'title':''},
        template='plotly',
        title=f" Daily New Cases in {prefix}{country} from {c.index[0].strftime('%Y-%m-%d')} till {c.index[-1].strftime('%Y-%m-%d')} and showing a 7 daily moving average",
        )

    moving_average = c.daily_new_cases.rolling(7).mean().dropna().astype(int)

    fig = go.Figure()
    fig.add_trace(go.Bar(name="Daily Cases", x=c.index, y=c.daily_new_cases, marker_color='black'))
    fig.add_trace(go.Scatter(name="Moving Average (7 Daily)", x=c.index[c.shape[0] - moving_average.shape[0]:], y=moving_average, line={'width':3, 'color':'green'}))
    fig.update_layout(layout)
    fig.show()
    
# 3. Daily new Deaths with 7-day moving average

    if not all(c.daily_new_deaths.isna()):
        layout = go.Layout(
            yaxis={'range':[0, c.daily_new_deaths.max() * 1.05],
                  'title':'Daily New Coronavirus Deaths'},
            xaxis={'title':''},
            template='plotly',
            title=f"Daily Deaths in {prefix}{country} from {c.index[0].strftime('%Y-%m-%d')} till {c.index[-1].strftime('%Y-%m-%d')}",
            )

        moving_average = c.daily_new_deaths.rolling(7).mean().dropna().astype(int)

        fig = go.Figure()
        fig.add_trace(go.Bar(name="Daily Deaths", x=c.index, y=c.daily_new_deaths, marker_color='black'))
        fig.add_trace(go.Scatter(name="7-Day Moving Average", x=c.index[c.shape[0] - moving_average.shape[0]:], y=moving_average, line={'width':3, 'color':'red'}))

        fig.update_layout(layout)
        fig.show()
In [34]:
countrystat('Italy') # change country name here
In [35]:
# countrystat('Australia') # change country name here

Death Rate Relative to Population

In [36]:
sort = summary.sort_values(['total_deaths_per_1m_population'])
sort = sort[sort['total_deaths_per_1m_population'].notna()]
sort['% of Population with Coronavirus Death Cases'] = sort['total_deaths_per_1m_population']/1_000_000
mean = sort['% of Population with Coronavirus Death Cases'].mean()
sort['color'] = sort.apply(lambda row: "Red" if row['% of Population with Coronavirus Death Cases'] > mean else "Blue", axis=1)
#sorted_by_deaths_per_1m.dropna(inplace=True)
fig = px.scatter(sort, x='country', y='% of Population with Coronavirus Death Cases',
                 size='% of Population with Coronavirus Death Cases',
                 color='color',
                 title=f"<b>Coronavirus Death-Rate by Country as of {df_daily.date.max().strftime('%Y-%m-%d')}</b>",
                 height=650)

fig.update_traces(marker_line_color='rgb(75,75,75)',
                  marker_line_width=1.5, opacity=0.8,
                  hovertemplate="<b>%{x}</b><br>%{y} of Population with Death Cases<extra></extra>",)
fig.update_layout(showlegend=False,
                 yaxis={"tickformat":".3%", "range":[0,sort['% of Population with Coronavirus Death Cases'].max() * 1.1]},
                 xaxis={"title": ""},
                 title_font_size=20)


callout = ["China", "Australia", "India", "South Africa", "Russia", "Italy","Brazil", "UK", "France", "USA",  "Bulgaria", "Peru"]

for i, country in enumerate(callout):
    print
    ay = 30 if i%2 else -30
    ax = 20
    if country == "Russia": ax = -20
    if country == "Czech Republic": ay, ax = -30, -60
    if country == "USA": ay = 50
    if country == "Italy": ay, ax = 30, -20
    if country == "UK": ay, ax = -30, 40
    if country == "Australia": ay = -30
    if country == "France": ay, ax = -60, -40
    if country == "Brazil": ax = -20
    if country == "Peru": ay = -30
    fig.add_annotation(
            x=country,
            y=sort['% of Population with Coronavirus Death Cases'][sort.index[sort.country==country][0]],
            xref="x",
            yref="y",
            text=country,
            showarrow=True,
            font=dict(
                family="Courier New, monospace",
                size=14,
                color="#ffffff"
                ),
            align="center",
            arrowhead=2,
            arrowsize=1,
            arrowwidth=2,
            arrowcolor="#636363",
            ax=ax,
            ay=ay,
            bordercolor="#c7c7c7",
            borderwidth=2,
            borderpad=4,
            bgcolor=sort['color'][sort.index[sort.country==country][0]],
            opacity=0.6
            )

fig.add_shape(type='line',
              x0=sort['country'].iloc[0], y0=mean,
              x1=sort['country'].iloc[-1], y1=mean,
              line=dict(color='black',width=1),
              xref='x', yref='y'
             )
fig.add_annotation(x=sort['country'].iloc[0], y=mean,
                   text=f"mean = {mean*100:.2f}%",
                   showarrow=False,
                   xanchor="left",
                   yanchor="bottom",
                   font={"color":"black", "size":14}
                  )
                  
fig.show()